Overview of data
In the previous lession we applied linear regression to make quantitative predictions. In this lesson, we will learn how a different type of linear regression, logistic regression, can be used to make class or category predictions. In its most basic form, this type of prediction is binary, meaning it has only two options: yes (1) or no (0); disease or no disease, etc. Using the same core data set from the previous lesson, we will attempt to classify children with chronic kidney disease by CKD stage as stage 2 vs stage 3b. The iGFRc column has been removed for this lesson, as this is how CKD stage is determined.
Our data is provided in two files. One has values for the outcome (Stage) for each subject ID and the other includes values for several predictors (e.g., creatinine, BUN, various endogenous metabolites) measured for each subject ID.
We will need to use our previously learned skills to read in the data and join the two sets by subject.
#load in CKD_data.csv and CKD_stage.csv
data <- read_csv("data/CKD_data.csv")
## Parsed with column specification:
## cols(
## id = col_double(),
## SCr = col_double(),
## BUN = col_double(),
## CYC_DB = col_double(),
## Albumin = col_double(),
## uPCratio = col_double(),
## ADMA = col_double(),
## SDMA = col_double(),
## Creatinine = col_double(),
## Kynurenine = col_double(),
## Trp = col_double(),
## Phe = col_double(),
## Tyr = col_double()
## )
glimpse(data)
## Observations: 200
## Variables: 13
## $ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
## $ SCr <dbl> 1.93, 2.72, 0.88, 0.78, 1.28, 1.41, 0.58, 1.34, 3.42,…
## $ BUN <dbl> 21, 37, 12, 27, 27, 19, 17, 13, 45, 39, 80, 52, 41, 2…
## $ CYC_DB <dbl> 0.82, 1.90, 0.65, 1.31, 1.50, 1.60, 1.04, 1.04, 3.04,…
## $ Albumin <dbl> 4.1, 4.2, 4.2, 4.4, 3.6, 4.2, 4.5, 3.8, 4.1, 4.0, 4.2…
## $ uPCratio <dbl> 1.39, 0.38, 0.33, 0.26, 0.57, 0.04, 0.12, 5.17, 0.26,…
## $ ADMA <dbl> 0.41, 0.69, 0.57, 0.39, 0.87, 0.48, 0.52, 1.14, 0.46,…
## $ SDMA <dbl> 0.70, 1.45, 0.49, 0.33, 2.31, 0.74, 0.46, 1.42, 0.96,…
## $ Creatinine <dbl> 138.61, 274.34, 78.81, 63.05, 226.21, 150.50, 70.84, …
## $ Kynurenine <dbl> 3.98, 7.35, 1.76, 2.41, 5.91, 4.29, 3.19, 6.18, 8.08,…
## $ Trp <dbl> 78.18, 45.02, 58.62, 34.64, 62.36, 52.21, 49.28, 101.…
## $ Phe <dbl> 79.45, 110.28, 74.47, 39.81, 75.40, 58.66, 53.82, 93.…
## $ Tyr <dbl> 77.00, 79.61, 65.22, 44.55, 94.24, 60.66, 68.07, 110.…
stage <- read_csv("data/CKD_stage.csv")
## Parsed with column specification:
## cols(
## id = col_double(),
## Stage = col_character()
## )
glimpse(stage)
## Observations: 200
## Variables: 2
## $ id <dbl> 498, 128, 13, 2, 183, 197, 78, 174, 91, 123, 168, 41, 130,…
## $ Stage <chr> "CKD3b", "CKD3b", "CKD3b", "CKD3b", "CKD3b", "CKD3b", "CKD…
#join by ID, convert ID and Stage variables to factors
ckd <- left_join(stage, data, by = "id") %>%
mutate(id = factor(id),
Stage = factor(Stage))
## left_join: added 12 columns (SCr, BUN, CYC_DB, Albumin, uPCratio, …)
## > rows only in x 0
## > rows only in y ( 0)
## > matched rows 200
## > =====
## > rows total 200
## mutate: converted 'id' from double to factor (0 new NA)
## converted 'Stage' from character to factor (0 new NA)
glimpse(ckd)
## Observations: 200
## Variables: 14
## $ id <fct> 498, 128, 13, 2, 183, 197, 78, 174, 91, 123, 168, 41,…
## $ Stage <fct> CKD3b, CKD3b, CKD3b, CKD3b, CKD3b, CKD3b, CKD3b, CKD3…
## $ SCr <dbl> 2.80, 3.14, 4.09, 2.72, 4.49, 3.86, 2.65, 3.06, 3.23,…
## $ BUN <dbl> 44, 45, 41, 37, 53, 44, 37, 52, 34, 47, 51, 47, 38, 6…
## $ CYC_DB <dbl> 2.63, 2.50, 2.97, 1.90, 4.94, 2.93, 2.25, 3.15, 3.04,…
## $ Albumin <dbl> 3.4, 4.2, 3.1, 4.2, 3.4, 4.1, 3.9, 4.5, 3.2, 3.8, 4.8…
## $ uPCratio <dbl> 6.79, 2.01, 1.50, 0.38, 1.11, 0.29, 7.52, 0.04, 14.23…
## $ ADMA <dbl> 0.99, 0.66, 0.77, 0.69, 0.93, 0.85, 0.66, 0.82, 0.75,…
## $ SDMA <dbl> 2.33, 1.68, 1.96, 1.45, 2.22, 2.45, 1.19, 1.77, 1.43,…
## $ Creatinine <dbl> 308.30, 293.63, 521.61, 274.34, 519.03, 512.06, 290.7…
## $ Kynurenine <dbl> 5.57, 8.72, 6.55, 7.35, 4.51, 7.47, 11.15, 9.95, 4.77…
## $ Trp <dbl> 36.89, 42.22, 45.49, 45.02, 47.73, 49.58, 50.33, 79.2…
## $ Phe <dbl> 73.72, 74.61, 116.18, 110.28, 98.99, 82.66, 85.27, 12…
## $ Tyr <dbl> 45.39, 51.82, 66.85, 79.61, 91.04, 68.90, 41.97, 94.2…
#how many subjects do we have? how many variables? how many subjects in each class?
Quick EDA
Let’s look at the summary statistics. We also need to be aware of any class bias. This is a situation where one class is over-represented in the data. If so, this can create problems with the modeling. The ideal state is for classes to be balanced. There are several ways to handle class imbalance problems, but they are outside the scope of this course. For this activity, we have provided data that is balanced.
prop.table(table(ckd$Stage))
##
## CKD2 CKD3b
## 0.5 0.5
summary(ckd)
## id Stage SCr BUN CYC_DB
## 1 : 1 CKD2 :100 Min. :0.350 Min. : 5.0 Min. :0.620
## 2 : 1 CKD3b:100 1st Qu.:0.930 1st Qu.:17.0 1st Qu.:1.020
## 3 : 1 Median :1.330 Median :29.0 Median :1.300
## 4 : 1 Mean :1.560 Mean :29.7 Mean :1.493
## 5 : 1 3rd Qu.:1.975 3rd Qu.:40.0 3rd Qu.:1.855
## 6 : 1 Max. :4.490 Max. :80.0 Max. :4.940
## (Other):194
## Albumin uPCratio ADMA SDMA
## Min. :2.600 Min. : 0.0000 Min. :0.160 Min. :0.180
## 1st Qu.:3.800 1st Qu.: 0.1775 1st Qu.:0.480 1st Qu.:0.560
## Median :4.200 Median : 0.4800 Median :0.620 Median :0.845
## Mean :4.138 Mean : 1.8991 Mean :0.624 Mean :1.001
## 3rd Qu.:4.500 3rd Qu.: 1.7800 3rd Qu.:0.750 3rd Qu.:1.380
## Max. :5.400 Max. :32.8700 Max. :1.540 Max. :2.840
##
## Creatinine Kynurenine Trp Phe
## Min. : 36.50 Min. : 1.080 Min. : 20.25 Min. : 29.56
## 1st Qu.: 98.82 1st Qu.: 3.510 1st Qu.: 49.91 1st Qu.: 69.24
## Median :137.25 Median : 4.525 Median : 63.41 Median : 83.56
## Mean :166.06 Mean : 5.074 Mean : 66.71 Mean : 84.17
## 3rd Qu.:226.30 3rd Qu.: 6.543 3rd Qu.: 79.50 3rd Qu.: 98.84
## Max. :521.61 Max. :12.350 Max. :152.22 Max. :151.26
##
## Tyr
## Min. : 30.39
## 1st Qu.: 62.45
## Median : 77.16
## Mean : 80.77
## 3rd Qu.: 99.19
## Max. :176.77
##
We can create boxplots for each variable, filled by Stage, to see if there are differences in distributions across the classes. This may provide clues about which variables may be good predictors for CKD Stage.
grpData <- gather(ckd, variable, value, 3:14)
## gather: reorganized (SCr, BUN, CYC_DB, Albumin, uPCratio, …) into (variable, value) [was 200x14, now 2400x4]
ggplot(grpData, aes(factor(variable), value, fill = Stage)) +
geom_boxplot() + facet_wrap(~variable, scale="free")

From the boxplots, it looks like we have several candidate predictors for Stage - some of which should be familiar and obvious to you. Collinearity is a problem for logistic regression that must be addressed for multivariate models. As before, we should have an idea of how the predictors correlate with each other.
cors <- ckd %>%
select(-id, -Stage) %>%
cor(use = 'pairwise.complete.obs')
## select: dropped 2 variables (id, Stage)
corrplot(cors, type="lower", method="circle", addCoef.col="black",
number.cex=0.45, tl.cex = 0.55, tl.col = "black",
col=brewer.pal(n=8, name="RdBu"), diag=FALSE)

The correlations range from low to high and in both directions. This is something we need to consider as we select predictors for our model.
Logistic regresssion
We can think about the probability or likelihood of a binary outcome as being between 0 and 1. Since the values of the outcome are then limited to 0 through 1, we don’t apply standard linear regression. If we tried to do this, our fit may be problematic and even result in an impossible value (i.e., values < 0 or > 1). We need a model that restricts values to 0 through 1. The logistic regression is one such model.
Instead of selecting coefficients that minimized the squared error terms from the best fit line, like we used in linear regression, the coefficients in logistic regression are selected to maximize the likelihood of predicting a high probability for observations actually belonging to class 1 and predicting a low probability for observations actually belonging to class 0.
Assumptions of logistic regression: - The outcome is a binary or dichotomous variable like yes vs no, positive vs negative, 1 vs 0. - There is a linear relationship between the logit of the outcome and each predictor variables. The logit function is logit(p) = log(p/(1-p)), where p is the probabilities of the outcome. - There are no influential values (extreme values or outliers) in the continuous predictors. - There are no high intercorrelations (i.e. multicollinearity) among the predictors.
Similar to the previous lesson, we will split the data, fit a model and then examine the model output on train and test data. In this case, we will use the glm function, which is commonly used for fitting Generalized Linear Models, of which logistic regression is one form. We specify that we want to use logistic regression using the argument family = "binomial". This returns an object of class “glm”, which inherits from the class “lm”. Therefore, it also includes attributes we can explore to learn about our model and its fit of our data.
A major difference is that logistic regression does not return a value for the observation’s class, it returns an estimated probability of an observation’s class membership. The probability ranges from 0 to 1 and value assignment to a class is based on a threshold. The default threshold is 0.5, but should be adjusted for the purpose of the prediction. Simple and multivariate versions of logistic regression are possible. Since we explored the difference with the linear regression, we will start this lesson with the multivariate model we ended with in the previous lesson.
Split the data
The data was provided to you after processing and cleaning, so we are able to skip these critical steps for this lesson. We start our modeling process by splitting our data into 75:25 train:test sets.
set.seed(439) #so we all get same random numbers
train <- sample(nrow(ckd), nrow(ckd) * 0.75)
test <- -train
ckd_train <- ckd[train, ] %>%
select(-id)
## select: dropped one variable (id)
ckd_test <- ckd[test, ] %>%
select(-id)
## select: dropped one variable (id)
Fit the model
We will fit a new model, modGLM, that uses SCr, BUN, and Kynurenine to predict Stage in the training set. As before, we will add the predicted probability values to the training set as a new variable, Stage_prob. The function contrasts shows what R is considering as the reference state for the prediction.
contrasts(ckd$Stage) #what is R considering the reference? CKD3b: 0 = N, 1 = Y
## CKD3b
## CKD2 0
## CKD3b 1
modGLM <- glm(Stage ~ SCr + BUN + Kynurenine, data = ckd_train, family = "binomial")
# what is in .fitted? Log odds.
head(augment(modGLM))
## # A tibble: 6 x 11
## Stage SCr BUN Kynurenine .fitted .se.fit .resid .hat .sigma
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 CKD2 1.36 23 3.74 -2.15 0.584 -0.469 0.0318 0.573
## 2 CKD3b 2.17 41 9.68 6.12 1.44 0.0662 0.00450 0.574
## 3 CKD3b 2.69 47 4.07 6.49 1.44 0.0550 0.00313 0.574
## 4 CKD2 0.64 12 5.06 -5.67 1.15 -0.0828 0.00449 0.574
## 5 CKD2 1.34 13 6.18 -4.74 1.13 -0.132 0.0111 0.574
## 6 CKD3b 1.8 33 4.03 1.46 0.572 0.646 0.0500 0.572
## # … with 2 more variables: .cooksd <dbl>, .std.resid <dbl>
# If we want probabilities for comparison, then we need to predict the train using type = "response"
#add the predicted values to the train set and set the type argument to response
ckd_train <- ckd_train %>%
mutate(Stage_prob = predict(modGLM, ckd_train, type = "response"))
## mutate: new variable 'Stage_prob' with 150 unique values and 0% NA
#using threshold 0.5, convert probabilities to predicted stage
ckd_train$Stage_pred<- ifelse(ckd_train$Stage_prob > 0.5, "CKD3b", "CKD2")
How did the model do at predicting stage in our training data? We can calculate the accuracy of the model and plot the density of the predicted probabilities by class.
#calculate accuracy, if == statement is TRUE, value = 1, otherwise = 0
mean(ckd_train$Stage_pred == ckd_train$Stage)
## [1] 0.9266667
ggplot(ckd_train, aes(Stage_prob, color = Stage)) +
geom_density()

Examining our model
Recalling the helpful functions we used from the broom package, we can examine our model. We see that the parameters for the logistic regression model are different than those we saw in the previous lesson on linear regression. R2 is not relevant for logistic regression. Instead, to compare models, we rely on parameters called AIC and BIC. These are the Akaike Information Criterion and the Bayesian Information Criterion. Each tries to balance model fit and parsimony and each penalizes differently for number of parameters. Models with the lowest AIC and lowest BIC are preferred.
Exercise 1:
Examine modGLM using the glance() and tidy() functions of the broom package. What is the AIC and BIC for this model? What are the coefficients for each term of the model?
End exercise
Examining collinearity
As mentioned before, we need to be careful when several predictors have strong correlation. Remember that we can calculate the variance inflation factor (VIF) for each model to determine how much the variance of a regression coefficient is inflated due to multicollinearity in the model. We want VIF values close to 1 (meaning no multicollinearity) and less than 5.
vif(modGLM)
## SCr BUN Kynurenine
## 1.119578 1.124932 1.005801
There does not seem to be a collinearity problem in our model.
Making predictions from our model
When we use the predict function on this model, it will predict the log(odds) of the Y variable. This is not what we ultimately want since we want to determine the predicted Stage. To convert it into prediction probability scores that are bound between 0 and 1, we specify type = “response”.
#predict on test
table(ckd_test$Stage) #CKD3b ~ 50%
##
## CKD2 CKD3b
## 25 25
ckd_test$Stage_prob <- predict(modGLM, ckd_test, type = "response")
With the predicted probabilities, we can now apply a threshold and assign each row to either the CKD3b or CKD2 class, based on probability. We will start with a threshold of 0.5. We know the actual assignment from the Stage column (of this training data) so we can calculate the accuracy of our model to predict class.
ckd_test$Stage_pred<- ifelse(ckd_test$Stage_prob > 0.5, "CKD3b", "CKD2")
mean(ckd_test$Stage_pred == ckd_test$Stage) #0.98
## [1] 0.9
Exercise 2:
Select a different threshold and determine the accuracy of the model for that threshold setting.
End exercise
Build ROC curve as alternative to accuracy
Sometimes calculating the accuracy is not good enough to determine model performance (especially when there is class imbalance and accuracy can be misleading) and using a threshold of 0.5 may not be optimal. We can use the pROC package functions to build an ROC curve and find the area under the curve (AUC) and view the effects of changing the cutoff value on model performance.
# Convert Stage to numeric probability variable (0 or 1)
ckd_test <- ckd_test %>%
mutate(Stage_num = ifelse(Stage == "CKD3b", 1, 0))
## mutate: new variable 'Stage_num' with 2 unique values and 0% NA
# Create a ROC curve object from columns of actual and predicted probabilities
ROC <- roc(ckd_test$Stage_num, ckd_test$Stage_prob)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# Plot the ROC curve object
ggroc(ROC, alpha = 0.5, colour = "blue")

# Calculate the area under the curve (AUC)
auc(ROC)
## Area under the curve: 0.9776
As expected, we were able to build a strong classifier model. Most real-world situations have less separation than we found in this lesson. In those cases, one must consider the purpose of the classifier and weight the importance of false positives versus false negatives. The ROC curve is helpful to find the optimal cutoff in those cases. Additional calculations of a confusion matrix to determine the sensitivity and specificity of the model would also be warranted.